Modelling global mammals diversity in a climate change scenario.

2016-12-22

This report is more a work report than a scientific study with the object in mind to make the analysis, methods and results reproducible.

Abstract

The prediction indicates a general decline of mammals diversity with regional differentiation. The strongest decline is found in the biodiversity hotspots of Central Africa and South America, while in North Australia, Indonesia, Madagascar the number of species is increasing. The compared predictive models vary in performance. Best Model, in terms of RMSE is ranger followed by neural net, glmnet and finally glm.

Introduction

The objective is to model the occurrence of mammals species in a climate change scenario and to make predictions for the period year 2050-2079. Further, the report also is used to compare and evaluate the performance of the different statistical models. The general approach is to fit my models with data, evaluate their performance and make predictions. I want to predict the occurrence of the mammals as a species richness index meaning the number of species in a given raster cell of 1°*1° (x, y). This requires some preprocessing of the data. Therefore the report is organised in:

  1. Prepare data
  2. Build models
  3. Evaluate performance
  4. Make predictions
  5. Discuss results

Prepare Data

The data can be divided into the tree categories: species distribution data, climate data and land use data. The present species distribution data of 5265 mammals is available as Shapefiles (IUCN Red list of threatened species). The climate data is generated by a simulation of IPSL-CM5A model for the RCP6.0 scenario of the IPCC. The land use data is produced by LPG-GUESS. The simulation is performed by Jörg Steinkamp (Senckenberg, Frankfurt, AG Hickler). The climate data consists of the same 4 variables, once measured than predicted:

The measurements are taken as averages for the period of the years 1971-2000, for every month. The prediction is based on the IPCC Scenario of +6°C for the period the years 2050-2079, again as averages per month. The land use data consists of two data set in .csv format - one for the period year 1971-2000 and another for the period year 2050-2079 predicted data set. Land use is divided into 20 categories with a resolution of o.5° * 0.5°.


First, I need to convert the data into the right format. My analysis is based on a grid with a resolution of 1° * 1° , therefore all my data must refer to this grid and need to be converted.

library(raster)
# create one-degree grid with CRS
grid1 <- raster(nrow=180, ncols=360, xmn=-180, xmx=180, ymn=-90, ymx=90, crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))

# save coordinates of grid1 as data.frame
coords1 <- as.data.frame(coordinates(grid1))

# save as R.data 
save(grid1, coords1, file = "grid.Rdata")

# remove files to save RAM
rm(grid1, coords1)

Species distribution data

I convert the grid from raster to polygons, so each cell is a polygon. Then I use the over() from the sp package with the polygons of each species and the polygons grid to calculate the distribution shape area of each species in each cell. Next, the species are simply counted. This method is pretty inconvenient and time-consuming but I failed to find an easier solution. Because it takes too long I simply load the already processed data and plot an example of the row data with Vulpes vulpes in Figure 1.

# plot row data
library(maptools)
mams <- readShapeSpatial("TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS")

#add geographic coordination system and projection  
mams@proj4string <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# make a plot with most spread species to illustrate (Vulpes-Fuchs)
# subset 
vulpes <- mams[grep(mams@data$binomial[which.max(mams@data$shape_Area)], mams@data$binomial),]

# map with tmap
library("tmap")
vulpes_map <- tm_shape(vulpes, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_polygons() 

tmap_leaflet(vulpes_map)

Figure 1 - Distribution of Vulpes vulpes(Fox)

# clear workspace
rm(list = ls())

# skip process if data already in the directory and read index data.frame
if ("speciesRichness1.Rdata" %in% dir()) {
  load("speciesRichness1.Rdata") # load data if already processed
} else {

load("grid.Rdata")
#read in shapefiles
library(maptools)
mams <- readShapeSpatial("TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS")

#add geographic coordination system and projection  
mams@proj4string <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# save coordinates of grid1 as data.frame
coords1 <- as.data.frame(coordinates(grid1))

#convert grid to polygon
grid1polygon <- rasterToPolygons(grid1)
# str(grid1polygon, max.level = 2)

# extract unique species names for indexing
speciesNames <- sort(unique(mams@data$binomial)) # 5265 levels

# create the empty matrix, one column refers to one cell of the reference grid
CountPerCellAllSpecies.mat <- matrix(NA, ncol=length(speciesNames), nrow=length(grid1polygon$layer))

# locate indexes where grid and species data overlay
# extract polygons over the grid
for (i in 1:length(speciesNames)){
   test <- over(grid1polygon, mams[which(mams@data$binomial == speciesNames[i]), ])$shape_Area 
  CountPerCellAllSpecies.mat[,i] <- test
  rm(test)
  # print(i)
}

# save 
save(CountPerCellAllSpecies.mat, file="CountPerCellAllSpecies.mat100x100.Rdata")

# calculate row sums as an index
SD <- rowSums(CountPerCellAllSpecies.mat > 0, na.rm = T)

# remove file to save RAM
rm(CountPerCellAllSpecies.mat)

# compute species richness index and add coordinates
speciesRichness1 <- cbind(coords1, SD)

# SD == 0, set NA
speciesRichness1$SD[speciesRichness1$SD == 0] = NA
# save large file to save RAM and processing time 
save(speciesRichness1, file="speciesRichness1.Rdata")
rm(list = ls())
}
load("speciesRichness1.Rdata")
# convert data.frame to spatial Raster
# count NAs
# sum(is.na(speciesRichness1$SD))
# 39868 NAs caused by the  sea
library(raster)
# creat raster
speciesRichness1_raster <-rasterFromXYZ(speciesRichness1, crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0") 

library(rworldmap)
# get world map with country borders
sPDF <- getMap()  

library(RColorBrewer)
#finding color palette
# display.brewer.all()
# take "Greens"
greens <- brewer.pal(5,"Greens")

# tmap
library("tmap")
sd_map <- tm_shape(speciesRichness1_raster, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = greens, title = "species per cell") +
  tm_legend(position = c("left", "bottom")) +
  tm_shape(sPDF) +
  tm_borders(alpha = 0.6) 

# convert to an interactive map by using leaflet package and base map
lf <- tmap_leaflet(sd_map)
lf

Figure 2 - Number of species per cell

# save ram
 rm(list = ls())

Figure 2, shows the species richness index as an interactive map. The map shows the number of species by cell and displays the typical biodiversity hot spots around the äquator. The number of species per cell ranges between 0 and 250.

Landuse data

The land use data need to be projected to a higher resolution (from 0.5° * 0.5° to 1° * 1°) and a larger extend (from Lat -56° - 83.5° to -90° - 90°). To do this I simply project the Land use raster to the grid raster with projectRaster() from the raster package. This projection is done for every raster layer and the actual values are all saved in one data frame. This working step is repeated twice, for the past data set and the future data set.

library(raster)  
load("speciesRichness1.Rdata")  
load("grid.Rdata")
# load data
LU_past <- read.csv2("LandUse_1970-1999.csv")

# save to later fill up missing variables in future dataset
save(LU_past, file = "LU_past.Rdata")

# add spacial ref and plot, make a raster
LU_past_r <- rasterFromXYZ(LU_past, crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))

# different extent
grid1@extent
class       : Extent 
xmin        : -180 
xmax        : 180 
ymin        : -90 
ymax        : 90 
LU_past_r@extent
class       : Extent 
xmin        : -180 
xmax        : 180 
ymin        : -56 
ymax        : 83.5 
# project land use raster to grid raster and extract values of each cell for every land use
SR_LU_past <- speciesRichness1

# select names
land_use_names <- names(LU_past_r)

for (i in 1:length(land_use_names)) {
  land_use_pro <- projectRaster(LU_past_r[[i]], grid1)
  SR_LU_past <- cbind(SR_LU_past, values(land_use_pro))
  names(SR_LU_past)[3+i] <- names(LU_past_r[[i]]) 
  # print(i)
  }
  
save(SR_LU_past, file = "SR_LU_past.Rdata")
rm(list = ls())

In the future data set two variables are missing (Boreal.Deciduous.Forest.Woodland, Temperate.Boreal.Mixed.Forest). I simply fill up the missing variables with the variables of the past data set.

# load nesessery data  
load("speciesRichness1.Rdata")
load("grid.Rdata")
load("SR_LU_past.Rdata")
load("LU_past.Rdata")

# load land use data
LU_future <- read.csv2("LandUse_2050-2079.csv")

# exclude  Smith2014, Smith2014_, primary, X.1, X
LU_future <- LU_future[,-c(3,4, 5, 24,25)]

# problem !!!
# variables in past_dataset but not in future dataset
names(LU_past)[!(names(LU_past) %in% names(LU_future))]
[1] "Boreal.Deciduous.Forest.Woodland" "Temperate.Boreal.Mixed.Forest"   
# missing "Boreal.Deciduous.Forest.Woodland" "Temperate.Boreal.Mixed.Forest", reuse the variables of past. dataset in future dataset.
# variables in future_dataset but not in past dataset
names(LU_future)[!(names(LU_future) %in% names(LU_past))]
character(0)
# all variables included

# fill up future dataset with the missing variables
library(dplyr)

# make tbl
tbl_LU_past <- tbl_df(LU_past)
missing <- dplyr::select(tbl_LU_past, Boreal.Deciduous.Forest.Woodland, Temperate.Boreal.Mixed.Forest)

# bind cols together
LU_future <- bind_cols(tbl_df(LU_future), missing)

# apply same order on future dataset
LU_future <- LU_future[,colnames(LU_past)]

# add spatial reference
LU_future_r <- rasterFromXYZ(LU_future, crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))

# project land use raster to grid raster and extract values of each cell for every land use
# for the future, without SD
SR_LU_future <- speciesRichness1

land_use_names <- names(LU_future_r)

for (i in 1:length(land_use_names)) {
  land_use_pro <- projectRaster(LU_future_r[[i]], grid1)
  SR_LU_future <- cbind(SR_LU_future, values(land_use_pro))
  names(SR_LU_future)[3+i] <- names(LU_future_r[[i]]) 
  # print(i)
}

# clean memory

save(SR_LU_future, file="SR_LU_future.Rdata")
rm(list = ls())

Climate data

The climate variables are converted to bioclim variables, which have the advantage of not being tied to a specific hemisphere. The conversion is done with the bioclim function of the climates packages. Next, the resolution and extent need to be adjusted. I convert the data to raster and use the extent and resolution of my reference grid torasterize the data again (aggregate by mean). Again this need to be done for the past and future data set.

# install.packages("climdex.pcic") # dependence of climates
# install.packages("climates", repo='http://www.rforge.net/', depend=T)

library("SDMTools")
library("chron")
library("zoo")
library("climates")

load("grid.Rdata")

# load data
Thistoric <- read.table("tas_ipsl-cm5a-lr_hist_1971-2000.csv", header=T)
Phistoric <- read.table("pr_ipsl-cm5a-lr_hist_1971-2000.csv", header=T)
Tminhistoric <- read.table("tasmin_ipsl-cm5a-lr_hist_1971-2000.csv", header=T)
Tmaxhistoric <- read.table("tasmax_ipsl-cm5a-lr_hist_1971-2000.csv", header=T)
coords_climate <- Thistoric[,1:2]

# find extent
library(raster)
Thistoric_points <- rasterFromXYZ(Thistoric)

# now convert to BIOCLIM predictors exclude the coordinates from conversion

biovarsHistoric <- bioclim(tmin=Tminhistoric[,-c(1,2)], tmax=Tmaxhistoric[,-c(1,2)], prec=Phistoric[,-c(1,2)], tmean=Thistoric[,-c(1,2)])

# summary(biovarsHistoric) 
# some NAs in var 15 (Precipitation Seasonality)

# kick out variable 3 (Isothermality 2/7) and 7 (Temperature Annual Range (5-6)), which are (non-)linear combinations of the others:

BiovarsHistoric <- biovarsHistoric[,-c(3, 7)]
# remove raw data for more memory
rm(biovarsHistoric, Thistoric, Phistoric, Tminhistoric, Tmaxhistoric)

# add coorinates to bioclim variables 

# put LC and climate together
# to convert bioclim Variables into spatial points to later rasterize them into the 1 dagree grid

# data.frame to bind all values to
clim.data_past <- cbind.data.frame(coords1)

# calculate bioclim variable for each cell
for (i in 1:length(colnames(BiovarsHistoric))) {
  # convert to points
  bioclim_points <- SpatialPointsDataFrame(coords_climate, as.data.frame(BiovarsHistoric[,i]))
  # add CRS
  bioclim_points@proj4string <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")
  # raserize points in grid
  bioclim_raster <- rasterize(bioclim_points, grid1, fun = mean,na.rm = T)
  #add values of raster in data.frame
  clim.data_past <- cbind.data.frame(clim.data_past,values(bioclim_raster[[2]]))
  #add names to data.frame
  names(clim.data_past)[2+i] <- colnames(BiovarsHistoric)[i] 
  # print(i)
}

save(clim.data_past, file = "clim.data_past.Rdata")
rm(list = ls())
load("grid.Rdata")

# load data
Thistoric <- read.table("tas_ipsl-cm5a-lr_rcp6p0_2050-2079.csv", header=T)
Phistoric <- read.table("pr_ipsl-cm5a-lr_rcp6p0_2050-2079.csv", header=T)
Tminhistoric <- read.table("tasmin_ipsl-cm5a-lr_rcp6p0_2050-2079.csv", header=T)
Tmaxhistoric <- read.table("tasmax_ipsl-cm5a-lr_rcp6p0_2050-2079.csv", header=T)
coords_climate <- Thistoric[,1:2]

# exclude the coordinates from conversion

biovarsFuture <- bioclim(tmin=Tminhistoric[,-c(1,2)], tmax=Tmaxhistoric[,-c(1,2)], prec=Phistoric[,-c(1,2)], tmean=Thistoric[,-c(1,2)])

# summary(biovarsFuture) 
# some NAs in var 15 (Precipitation Seasonality)

# kick out variable 3 (Isothermality 2/7) and 7 (Temperature Annual Range (5-6)), which are (non-)linear combinations of the others:

BiovarsFuture <- biovarsFuture[,-c(3, 7)]
# remove raw data for more memory
rm(biovarsFuture, Thistoric, Phistoric, Tminhistoric, Tmaxhistoric)

# add coorinates to bioclim variables future

# put LC and climate together
# to convert bioclim variables into spatial points to later rasterize them into the 1 dagree grid
load("grid.Rdata")

# data.frame to bind all values to
clim.data_future <- cbind.data.frame(coords1)

# calculate bioclim variable for each cell
for (i in 1:length(colnames(BiovarsFuture))) {
  # convert to points
  bioclim_points <- SpatialPointsDataFrame(coords_climate, as.data.frame(BiovarsFuture[,i]))
  # add CRS
  bioclim_points@proj4string <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")
  # raserize points in grid
  bioclim_raster <- rasterize(bioclim_points, grid1, fun = mean,na.rm = T)
  #add values of raster in data.frame
  clim.data_future <- cbind.data.frame(clim.data_future,values(bioclim_raster[[2]]))
  #add names to data.frame
  names(clim.data_future)[2+i] <- colnames(BiovarsFuture)[i] 
  # print(i)
}

save(clim.data_future, file = "clim_future.Rdata")
rm(list = ls())

The last step of the data preparation is to bring all the data together. I combine all variables for the past as one predictor data frame for training the models and all future variables as one prediction data frame to make predictions.

library(dplyr)
load("SR_LU_past.Rdata")
load("SR_LU_future.Rdata")
load("clim.data_past.Rdata")
load("clim_future.Rdata")

# join data for past dataset
all.data_past <- left_join(SR_LU_past, clim.data_past, by = c("x","y"))

# join data for future dataset
all.data_future <- left_join(SR_LU_future, clim.data_future, by = c("x","y"))

# save files
save(all.data_past, file="all.data_past.Rdata")
save(all.data_future, file = "all.data_future.Rdata")

# free memory
rm(list = ls())

Spatial blocks for cross-validation index

To cope with spatial autocorrelation of my data I use spatial blocks for cross-validation. Because the global nauture of the data it is likely to encounter spatial autocorrelation. Near sites are more related than distant sites, that goes for spatial data in general but global data has an even more estimated variability and regional particularity. I fit my models with a 10 fold cross-validation in the training phase to minimise overfitting. By default, the holdouts are taken randomly, which means it is assumed that observations are independent, but that’s not true for spatial data and will lead to overfitting and over confident models! Therefore, I take spatially related holdout to take the spatial autocorrelation into account and prevent my models from overfitting. I do this by dividing the indices into 10 spatial blocks. This list of indices is later used to train the models. Figure 3, presents the spatial split of the indexes.

library(raster)

# load all data
load("all.data_past.Rdata")
load("grid.Rdata")

# aggragate with a factor of ~ 10
grid10 <- aggregate(grid1, 10)

# replace values of new layer with random numbers of 1:10
values(grid10) <- sample(1:10, ncell(grid10), replace = T)

# disaggregate blocks
grid1_SB <-disaggregate(grid10, 10)

# visualize
library(tmap)
data(World)

tm_shape(World) +
  tm_borders() +
  tm_shape(grid1_SB) +
  tm_raster(alpha = 0.4, palette = "cat", legend.show = F)

Figure 3 - Spatial blocks for cross-validation

Figure 3 - Spatial blocks for cross-validation
# get coordinates of spatial blocks 
coords1_SB <- as.data.frame(coordinates(grid1_SB))

# add spatial block indize values to coordinates
coords1_SB$Values <- values(grid1_SB)

library(dplyr)
# join values of spatial blocks to all.data. set join by x, y
all.data_past_SB <- left_join(all.data_past, coords1_SB, by = c("x","y"))

# exclude coordinates from analysis, this is the training dataset
training_set_SB <- all.data_past_SB

# add values of spatial blocks to training dataset
training_set_SB$cvindex <- coords1_SB$Values

# remove all NA from dataset
training_set_SB <- na.omit(training_set_SB)

# creat new spatial block indices for index argument in trainControll
index10 <- list("site1", "site2","site3","site4","site5","site6","site7","site8","site9","site10")

# loop to add training indices to index object (list of vectors with integars)
for (i in 1:10){
index10[[i]] <- as.integer(which(training_set_SB$cvindex != i))
}

# save for further analysis relevent data
save(training_set_SB, index10, file = "training_set_SB.Rdata")

# clear workspace
rm(list = ls())

Model building

For all my predictive models I use the caretpackage because of its application possibilities and convenient usability. The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models in R. I use:

Before I start training the models I exclude highly correlated (> 0.75) predictors from the training set.

library(caret)
# load data for analysis
load("training_set_SB.Rdata")

# find correlated predictors

# "corr" in preProcess gives an error "subscript out of bounds" `therefore, I apply it before the train object.
high_cor <- cor(training_set_SB[,-c(1,2,41,42)])

# set cutoff
high_cor_cut <- findCorrelation(high_cor, cutoff = .75)

# selected predictores
names(training_set_SB)[high_cor_cut]
 [1] "bioclim_9"                "bioclim_4"               
 [3] "Tropical.Seasonal.Forest" "bioclim_1"               
 [5] "bioclim_6"                "bioclim_8"               
 [7] "bioclim_10"               "bioclim_2"               
 [9] "bioclim_14"               "bioclim_15"              
# kick out high correlated variables
pre_training_set<- training_set_SB[, -high_cor_cut]

# save training data
save(pre_training_set, index10, file = "training_data.Rdata")

# free memory
rm(list = ls())

Fitting regressions

GLM

The first GLM serves more like a test and reference model, the estimated performance is low.

# loading the already processed models to save time
load("models.Rdata")
# fitting a glmnet
library(caret)

#fitting glm
model_glm <- train(
  SD ~ ., 
  data =pre_training_set[,-c(1,2,31,32)],
  method = "glm",
  na.action = na.omit,
  preProcess = c("center", "scale", "YeoJohnson", "nzv"),
  trControl = trainControl(
    method = "cv",
    verboseIter = F,
    number = 10,
    index = index10,
    allowParallel = F,
    family
)
model_glm
Generalized Linear Model 

15421 samples
   27 predictors

Pre-processing: centered (16), scaled (16), Yeo-Johnson
 transformation (16), remove (11) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13859, 14109, 13964, 14038, 13764, 13653, ... 
Resampling results:

  RMSE      Rsquared 
  27.24268  0.6155897

 
# variable imortance
imp_var_glm <- varImp(model_glm)

ggplot(imp_var_glm, mapping = NULL,
  top = 5, environment = NULL)

Figure 4 - Importance of predictors glm

Figure 4 - Importance of predictors glm
# prediction
glm_predict <- predict(model_glm, newdata = pre_training_set[,-c(1,2,31,32)])

# add to dataset
fit_project <- as.data.frame(glm_predict)

#RMSE
round(model_glm$results$RMSE, 2)
[1] 27.24

The GLM has an RMSE of 27.24. The most important predictor is “Tropical.Rain.Forest”, see Figure 4. The number of land use predictors is obvious.

GLMnet

The glmnet model attempts to find a simple model by combining lasso and ridge regression:

  • lasso regression - penalty on a number of coefficients.
  • ridge regression - penalty on large coefficients.

The model can fit a mix of both regressions by the tuning parameters:

  • alpha (0-1) - 0 is pure lasso - 1 is pure ridge - regression, remaining values are a mix of both).
  • Lambda (0-infinity) -size of the penalty

By default caret train function uses 3 different values of alpha and lambda in the training phase and selects the combination with the highest RMSE to train the final model. Because the model is faster I now also use all interactions of the predictors to fit the model.

# exclude x y values and intex from training data

model_glmnet <- train(
  SD ~ (.)^2,
  data = pre_training_set[,-c(1,2,31,32)],
  method = "glmnet",
  na.action = na.omit,
  preProcess = c("center", "scale", "YeoJohnson", "nzv"),
  trControl = trainControl(
    method = "cv",
    verboseIter = F,
    number = 10,
    index = index10,
    allowParallel = F)
)
model_glmnet
glmnet 

15421 samples
   27 predictors

Pre-processing: centered (121), scaled (121), Yeo-Johnson
 transformation (121), remove (257) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13859, 14109, 13964, 14038, 13764, 13653, ... 
Resampling results across tuning parameters:

  alpha  lambda      RMSE      Rsquared 
  0.10   0.06654663  23.82367  0.6871388
  0.10   0.66546628  24.49496  0.6756355
  0.10   6.65466285  26.17280  0.6466604
  0.55   0.06654663  23.94082  0.6848139
  0.55   0.66546628  25.17480  0.6642050
  0.55   6.65466285  27.20560  0.6289958
  1.00   0.06654663  24.08079  0.6822378
  1.00   0.66546628  25.92833  0.6504005
  1.00   6.65466285  27.86125  0.6246701

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were alpha = 0.1 and lambda
 = 0.06654663. 
# variable imortance
imp_var_glmnet <- varImp(model_glmnet)
ggplot(imp_var_glmnet, mapping = NULL,
  top = 5, environment = NULL)

Figure 5 - Importance of predictors glmnet

Figure 5 - Importance of predictors glmnet
# prediction
glmnet_predict <- predict(model_glmnet, newdata = pre_training_set[,-c(1,2,31,32)])

# add to dataset
library(dplyr)

fit_project <- bind_cols(fit_project, as.data.frame(glmnet_predict))
min(model_glmnet$results$RMSE)
[1] 23.82367
#RMSE
round(min(model_glmnet$results$RMSE),2)
[1] 23.82

The RMSE of 23.74 is reached by the lowest alpha = 0.1 and lowest Lambda = 0.06. This means, mainly lasso regression because of the number of predictors (121), but a low penalty. Different than the GLM model the most important predictors are all climate predictors, especially bioclim_11 - “Mean Temperature of Coldest Quarter”

Ranger

Ranger is a fast implementation of random forest model. Unlike linear models, they have hyperparameter to tune which needs to be manually specified. Random forest starts with a single decision tree and improving the accuracy of a single model by fitting many decision trees, each is fit to a different bootstrap sample of the original dataset, what’s called bootstrap aggregation or bagging. In addition, random forest randomly samples the columns at each split. The number of this random sampling is controlled by the hyperparameter mtry. In caret this is simply done by passing a data.frame with the values to the tuneGridfunction. Again I use all interactions as predictors.

# fitting ranger

model_ranger <- train(
  SD ~ (.)^2,
  data = pre_training_set[,-c(1,2,31,32)],
  tuneGrid = data.frame(mtry = c(5,15,25,50,100)),
  method = "ranger",
  importance = "impurity",
  preProcess = c("center", "scale", "YeoJohnson", "nzv"),
  na.action = na.omit,
  trControl = trainControl(
    method = "cv",
    verboseIter = F,
    number = 10,
    index = index10,
    allowParallel = F)
)
model_ranger
Random Forest 

15421 samples
   27 predictors

Pre-processing: centered (121), scaled (121), Yeo-Johnson
 transformation (121), remove (257) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13859, 14109, 13964, 14038, 13764, 13653, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared 
    5   18.91342  0.7930589
   15   18.38847  0.8033572
   25   18.26911  0.8050054
   50   18.21112  0.8054538
  100   18.32113  0.8031135

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 50. 
# variable imortance
imp_var_ranger <- varImp(model_ranger)
ggplot(imp_var_ranger, mapping = NULL,
  top = 5, environment = NULL)

Figure 6 - Importance of predictors ranger

Figure 6 - Importance of predictors ranger
# prediction
ranger_predict <- predict(model_ranger, newdata = pre_training_set[,-c(1,2,31,32)])


# add to dataset
fit_project <- bind_cols(fit_project, as.data.frame(ranger_predict))

Figure 7 - paramteter mtry and RMSE for ranger Figure 7 - paramteter mtry and RMSE for ranger

# plot mtry
plot(model_ranger)
#RMSE
round(min(model_ranger$results$RMSE), 2)
[1] 18.21

Ranger has with 18.21 the lowest RMSE of all models. The most suitable parameter mtry is 50. Favorited predictors are bioclim_11 (“Mean Temperature of Coldest Quarter”) and bioclim_16 (“Precipitation of Wettest Quarter”).

avNNet

The network learns by backpropagation. This involves comparing the output a network produces with the output it was meant to produce, and using the difference between them to modify the weights of the connections between the units in the network, working from the output units through the hidden units to the input units—going backwards (back - propagation).

The size parameter sets the number of hidden layers. the decay parameter corresponds to the learningrat of back - propagation, it seems to me like a penalty parameter to prevent overfitting. Both parameters need to be manually specified. In caret it is again streamlined with a grid search method, by passing a custom grid to the tuneGridfunction inside train. The avNNet is a neural net with build in variable selection.

# avNNet build in feuture selection and all interactions
my.grid <- expand.grid(.decay = c(0.1 ,0.01, 0.001), .size = c(5, 10, 15, 20, 27),.bag = F)
model_avNNet <- train(SD ~ .,
                   data = pre_training_set[,-c(1,2,31,32)],
                   method = "avNNet",
                   na.action = na.omit,
                   maxit = 1000,
                   tuneGrid = my.grid,
                   preProcess = c("center", "scale", "YeoJohnson", "nzv"),
                   trace = F,
                   linout = T,
                  trControl = trainControl(
    method = "cv",
    verboseIter = F,
    number = 10,
    index = index10))
# summary
model_avNNet
Model Averaged Neural Network 

15421 samples
   27 predictors

Pre-processing: centered (16), scaled (16), Yeo-Johnson
 transformation (16), remove (11) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13859, 14109, 13964, 14038, 13764, 13653, ... 
Resampling results across tuning parameters:

  decay  size  RMSE      Rsquared 
  0.001   5    22.43564  0.7156802
  0.001  10    21.33262  0.7334087
  0.001  15    21.25111  0.7346080
  0.001  20    20.90254  0.7484191
  0.001  27    20.71435  0.7500304
  0.010   5    22.58553  0.7120683
  0.010  10    21.32164  0.7331128
  0.010  15    20.94210  0.7390871
  0.010  20    20.95278  0.7442967
  0.010  27    20.51175  0.7533365
  0.100   5    22.65423  0.7086500
  0.100  10    21.32096  0.7331006
  0.100  15    21.08402  0.7368495
  0.100  20    20.77889  0.7483215
  0.100  27    20.72185  0.7475689

Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 27, decay = 0.01 and bag
 = FALSE. 
# variable imortance
imp_var_avNNet <- varImp(model_avNNet, scale = T)

ggplot(imp_var_avNNet, mapping = NULL,
  top = 5, environment = NULL)

Figure 7 - Importance of predictors avNNet

Figure 7 - Importance of predictors avNNet
# prediction
avNNet_predict <- predict(model_avNNet, newdata = pre_training_set[,-c(1,2,31,32)])

# add to dataset
fit_project <- bind_cols(fit_project, as.data.frame(avNNet_predict))

Figure 8 - Weight and decay parameter of avNNet Figure 8 - Weight and decay parameter of avNNet

# plot 
plot(model_avNNet)
# save all models
save(pre_training_set, model_glm, model_glmnet, model_ranger, model_avNNet, fit_project, file = "models.Rdata")

#RMSE
round(min(model_avNNet$results$RMSE),2)
[1] 20.51

The RMSE of the neural net is 20.51 and therfore better than a GLM and GLMnet but it also takes way longer, i would try it with a higher size parameter but it calculated almost 10 hours last time.
For the model the most important predictors are bioclim_16(“Precipitation of Wettest Quarter”) and bioclim_13 (“Precipitation of Wettest Period”), what is almost the same.

Summary

Figure 9 - RMSE in comparison Figure 9 - RMSE in comparison

# creat list of models
models <- list(glm = model_glm, glmnet = model_glmnet, ranger = model_ranger, neural_net = model_avNNet)

# resamples models
resample_models <- resamples(models)

# plot performance meassure
library(lattice)
bwplot(resample_models, metric ="RMSE")

In terms of RMSE, ranger seems to be the best model, followed by the neural net, GLMnet and last GLM. I find it revealing, how the models behave with their assessment of predictor importance. The GLM favored Land use Predictors as “Tropical.Rain.Forest” and has the lowest performance. GLMnet focusses mainly on climate predictors, especially the temperature in different variations and has a slightly better RMSE. The neural net focusses on precipitation predictors and is even better than the GLMnet. Finally, ranger favours an interaction between precipitation and temperature and has the best result.

Calculate prediction anomaly

The prediction anomaly is supposed to be, like a bias I can calculate and simply add to the final prediction of my model. I did predictions in my test set, where I know the actual truth in form of the number of species per cell (SD). I simply calculate the difference between the predicted and the actual number of species per cell, that’s my anomaly. The idea or hope is, the difference between the predicted and actual response variable for the training data is similar to the one on my final prediction dataset, where I don’t know the actual truth. The anomaly serves as a correction.

# clear workspace
rm(list = ls())

# load model data
load("models.Rdata")
library(dplyr)
# convert data.frame to tbl
pre_training_set_tbl <- tbl_df(pre_training_set)
fit_project_tbl <- tbl_df(fit_project)

# select for modelsresults
anomaly_results <- bind_cols(dplyr::select(pre_training_set_tbl, x, y, SD), dplyr::select(fit_project_tbl, glmnet_predict, glm_predict, ranger_predict, avNNet_predict))

# add anomalies
anomaly_results <- mutate(anomaly_results, anomaly_glmnet = glmnet_predict - SD, anomaly_glm =glm_predict - SD , anomaly_ranger = ranger_predict - SD, anomlay_AvNNet = avNNet_predict - SD)

# save file
save(anomaly_results, file = "anomaly.Rdata")
rm(list = ls())

By visualising the anomaly the location of the strongest errors and the direction are revealed. Minus is less and plus is more than actual truth.

# load anomaly data
load("anomaly.Rdata")

library(raster)
# anomaly glm
anomaly_glm_ras <- rasterFromXYZ(anomaly_results[,c(1,2,9)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly glmnet
anomaly_glmnet_ras <- rasterFromXYZ(anomaly_results[,c(1,2,8)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly ranger
anomaly_ranger_ras <- rasterFromXYZ(anomaly_results[,c(1,2,10)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly AvNNet
anomaly_AvNNEt_ras <- rasterFromXYZ(anomaly_results[,c(1,2,11)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# build a raster stack
s <- stack(anomaly_glmnet_ras, anomaly_glm_ras, anomaly_ranger_ras, anomaly_AvNNEt_ras)

# spatial grid data frame for plot
sp <- as(s, 'SpatialGridDataFrame')

# make color palette
library(RColorBrewer)
my_palette <- brewer.pal(n = 5, name = "BrBG")

# plot anomalies
spplot(sp, col.regions = my_palette, cuts = 4)

Figure 10 - Spatial distribution of prediction anomaly

Figure 10 - Spatial distribution of prediction anomaly

Figure 10 show the spatial distribution of the prediction anomaly of all models. The models have different difficulties in mapping the biodiversity hot spots, where most errors occur. The previous rank by RMSE is clearly visible. But with the exception of ranger the models seem to have a similar spatial pattern in their errors. The north of south-America is overestimated, while the south is underestimated. Australia always seems to be overestimated, just like New-Zealand, Indonesia and Madagaskar. Central-Africa is underestimated.

Predictions

Finally, the models predict on the hypothetic dataset created out of the RCP6.0 Scenario of the IPCC for the period year 2050-2079. I use the anomalies for correction (subtract the anomaly from the prediction).

rm(list = ls())
# load data and models
load("all.data_future.Rdata")
load("models.Rdata")
load("anomaly.Rdata")

# prepare data
# kick NAs
predict_set <- na.omit(all.data_future)

# kick SD
predict_set <- predict_set[,-3]

# kick out last row to match with training set
predict_set <- predict_set[-15422,]

# preditions glmnet
predict_glmnet_final <-predict(model_glmnet, newdata = predict_set[,-c(1,2)])

# create final results dataframe
final <- as.data.frame(predict_glmnet_final)

# precitions glm
final$predict_glm_final <- predict(model_glm, newdata = predict_set[,-c(1,2)])

# precitions ranger
final$predict_ranger_final <- predict(model_ranger, newdata = predict_set[,-c(1,2)])

# precitions AvNNet
final$predict_AvNNet_final <- predict(model_avNNet, newdata = predict_set[,-c(1,2)])

# creat results dataset with projections and anomaly
results <- bind_cols(dplyr::select(anomaly_results, x, y, SD, anomaly_glmnet, anomaly_glm, anomaly_ranger, anomlay_AvNNet), final)

# create change and corrected predictions
results <- mutate(results, pred_glmnet_corr = predict_glmnet_final - anomaly_glmnet, pred_glm_corr = predict_glm_final - anomaly_glm, pred_ranger_corr = predict_ranger_final - anomaly_ranger, pred_AvNNet_corr = predict_AvNNet_final - anomlay_AvNNet, change_glmnet = pred_glmnet_corr - SD, change_glm = pred_glm_corr - SD, change_ranger = pred_ranger_corr -SD, change_avNNet = pred_AvNNet_corr - SD)

save(results, file = "results.Rdata")
rm(list = ls())

Results

The results are presented as an interactive map, see Figure 11 of the actual species richnes per cell, the predictions and the change compared to today for every model. In the upper-left, you can select the layers. First, unselect all layers, choose a base map of your liking and explore the data one layer at a time. don’t forget to zoom to a location of your interest.

# load model predictions
load("results.Rdata")

# load species distribution data
load("speciesRichness1.Rdata")

# convert to raster files 
library(raster)
# species distribution recent
speciesRichness_today <-rasterFromXYZ(speciesRichness1, crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0") 

# glmnet
prediction_glmnet <- rasterFromXYZ(results[,c(1,2,12)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# glm
prediction_glm <- rasterFromXYZ(results[,c(1,2,13)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# ranger
prediction_ranger <- rasterFromXYZ(results[,c(1,2,14)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# AVNNet
prediction_AvNNet <- rasterFromXYZ(results[,c(1,2,15)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly glmnet
change_glmnet <- rasterFromXYZ(results[,c(1,2,16)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly glm
change_glm <- rasterFromXYZ(results[,c(1,2,17)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly ranger
change_ranger <- rasterFromXYZ(results[,c(1,2,18)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# anomaly AvNNet
change_AvNNEt <- rasterFromXYZ(results[,c(1,2,19)], crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")

# creat own color palette
library(RColorBrewer)
#finding color palette
# display.brewer.all()
# take "Greens"
greens <- brewer.pal(5,"Greens")
div_red_green <- brewer.pal(7,"BrBG")
# display.brewer.pal(7,"BrBG")
# display.brewer.pal(5,"Greens")

# tmap
library("tmap")
sd_map <- tm_shape(speciesRichness_today, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
          tm_raster(palette = greens, title = "Today", breaks = c(0,50,100,150,200,250,300), legend.show = F) +
  
  tm_shape(prediction_glmnet, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = greens, title = "Prediction", breaks = c(0,50,100,150,200,250,300), legend.show = T) +
  
  tm_shape(prediction_glm, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = greens, title = "species per cell", breaks = c(0,50,100,150,200,250,300), legend.show = F) +
  
  tm_shape(prediction_ranger, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = greens, title = "species per cell", breaks = c(0,50,100,150,200,250,300), legend.show = F) +
  
  tm_shape(prediction_AvNNet, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = greens, title = "species per cell", breaks = c(0,50,100,150,200,250,300), legend.show = F) +
  
  tm_shape(change_glmnet, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = div_red_green, breaks = c(-150,-100,-50,0,50,100,150), legend.show = T, title = "Change") +
  
  tm_shape(change_glm, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = div_red_green, breaks = c(-150,-100,-50,0,50,100,150), legend.show = F) +
  
  tm_shape(change_ranger, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = div_red_green, breaks = c(-150,-100,-50,0,50,100,150), legend.show = F) +
  
  tm_shape(change_AvNNEt, projection = "+proj=longlat +ellps=WGS84  +datum=WGS84 +no_defs +towgs84=0,0,0") +
  tm_raster(palette = div_red_green, breaks = c(-150,-100,-50,0,50,100,150), legend.show = F) 
  
# convert to an interactive map by using leaflet package and base map
lf <- tmap_leaflet(sd_map)
lf

Figure 11 - Predictions of all models and changes compared to today


All models predict mostly a decline of species per cell between 50 - 100 species. Especially the regions with the highest biodiversity are affected and suffer the highest losses. This general trend does not apply for the west of Sout-America, Madagascar, north-Australia, New-Zealand and east-Indonesia. The most accurate model in the training - ranger - is less extreme in prediction than the others.


Outlook

The first aim would be to better understand the models. The prediction anomalies already pointed out that most models have a significant error. It would be important to further investigate the source of the error. Globally that’s tough because of the spatial variation, some analyses on a smaller scale and a later comparison would be more revealing, I think. The focus would be on the interaction of predictors and model architecture, to better understand the strengths and weaknesses of the particular models. Once better understood, it would be interesting to use caret ensemble to repeat and compare the prediction. Finally, it’s important to measure the uncertainty of the predictions, which is normally done by bootstrapping. This is challenging because the size of data makes the models slow. It would be necessary to drastically speed up the models. Do calculation in parallel would be one suggestion.